Data Analysis with R

Merging (Joining) datasets

  • Combine data of two datasets in one dataset.
  • Needed: Unique identifiers for observations (‘keys’).

Consider the following two data sets.

# load packages
library(tidyverse)

# initiate data frame on persons personal spending
df_c <- data.frame(id = c(1:3,1:3),
                   money_spent= c(1000, 2000, 6000, 1500, 3000, 5500),
                   currency = c("CHF", "CHF", "USD", "EUR", "CHF", "USD"),
                   year=c(2017,2017,2017,2018,2018,2018))
df_c
##   id money_spent currency year
## 1  1        1000      CHF 2017
## 2  2        2000      CHF 2017
## 3  3        6000      USD 2017
## 4  1        1500      EUR 2018
## 5  2        3000      CHF 2018
## 6  3        5500      USD 2018
# initiate data frame on persons' characteristics
df_p <- data.frame(id = 1:4,
                   first_name = c("Anna", "Betty", "Claire", "Diane"),
                   profession = c("Economist", "Data Scientist", "Data Scientist", "Economist"))
df_p
##   id first_name     profession
## 1  1       Anna      Economist
## 2  2      Betty Data Scientist
## 3  3     Claire Data Scientist
## 4  4      Diane      Economist

df_c contains information on person’s consumption spending, df_p information on these person’s personal characteristics. We can merge/join the two data frames via a unique person identifier in column id. There are four basic ways of joining the two datasets when not all of the observations are complete:

  1. The Inner Join: only complete observations are kept in the resulting new data frame.
df_merged <- merge(df_p, df_c, by="id")
df_merged
##   id first_name     profession money_spent currency year
## 1  1       Anna      Economist        1000      CHF 2017
## 2  1       Anna      Economist        1500      EUR 2018
## 3  2      Betty Data Scientist        2000      CHF 2017
## 4  2      Betty Data Scientist        3000      CHF 2018
## 5  3     Claire Data Scientist        6000      USD 2017
## 6  3     Claire Data Scientist        5500      USD 2018
  1. The Outer Join: all observations are kept, missing values are indicated with NAs.
df_merged <- merge(df_p, df_c, by="id", all=TRUE)
df_merged
##   id first_name     profession money_spent currency year
## 1  1       Anna      Economist        1000      CHF 2017
## 2  1       Anna      Economist        1500      EUR 2018
## 3  2      Betty Data Scientist        2000      CHF 2017
## 4  2      Betty Data Scientist        3000      CHF 2018
## 5  3     Claire Data Scientist        6000      USD 2017
## 6  3     Claire Data Scientist        5500      USD 2018
## 7  4      Diane      Economist          NA     <NA>   NA
  1. The Left Join: all observations in the “left” data frame (the first argument; x) are kept, but not all in the “right” data frame (the second argument, y). Again, missing values are indicated with NAs.
df_merged <- merge(df_p, df_c, by="id", all.x = TRUE, all.y = FALSE)
df_merged
##   id first_name     profession money_spent currency year
## 1  1       Anna      Economist        1000      CHF 2017
## 2  1       Anna      Economist        1500      EUR 2018
## 3  2      Betty Data Scientist        2000      CHF 2017
## 4  2      Betty Data Scientist        3000      CHF 2018
## 5  3     Claire Data Scientist        6000      USD 2017
## 6  3     Claire Data Scientist        5500      USD 2018
## 7  4      Diane      Economist          NA     <NA>   NA
  1. The Right Join: all observations in the “right” data frame (the first argument; x) are kept, but not all in the “left” data frame (the second argument, y). Again, missing values are indicated with NAs.
df_merged <- merge(df_p, df_c, by="id", all.x = FALSE, all.y = TRUE)
df_merged
##   id first_name     profession money_spent currency year
## 1  1       Anna      Economist        1000      CHF 2017
## 2  1       Anna      Economist        1500      EUR 2018
## 3  2      Betty Data Scientist        2000      CHF 2017
## 4  2      Betty Data Scientist        3000      CHF 2018
## 5  3     Claire Data Scientist        6000      USD 2017
## 6  3     Claire Data Scientist        5500      USD 2018

Merging (joining) datasets: the concept

Different approaches to merging two datasets: left join, right join, inner join, outer join.

Different approaches to merging two datasets: left join, right join, inner join, outer join.

Merging (joining) datasets in R

Overview by Wickham and Grolemund (2017):

dplyr (tidyverse) base::merge
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE),
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)

Data Summaries: Selecting, Filtering, and Mutating

Data summaries

  • First step of analysis.
  • Get overview over dataset.
  • Show key aspects of data.
    • Inform your own statistical analysis.
    • Inform audience (helps understand advanced analytics parts)

Typical tasks related to the last preparatory steps for analytics or the computation of data summaries are:

  • Select subset of variables (e.g., for comparisons).
  • Filter the dataset (some observations not needed in this analysis).
  • Mutate the dataset: additional values needed

With their corresponding functions in tidyverse:

  • select()
  • filter()
  • mutate()

Examples: selection, filtering, mutating

summary() computes a few key summary statistics for each variable:

# summary statistics for each variable
summary(df_merged)
##        id        first_name         profession         money_spent     currency        
##  Min.   :1.00   Length:6           Length:6           Min.   :1000   Length:6          
##  1st Qu.:1.25   Class :character   Class :character   1st Qu.:1625   Class :character  
##  Median :2.00   Mode  :character   Mode  :character   Median :2500   Mode  :character  
##  Mean   :2.00                                         Mean   :3167                     
##  3rd Qu.:2.75                                         3rd Qu.:4875                     
##  Max.   :3.00                                         Max.   :6000                     
##       year     
##  Min.   :2017  
##  1st Qu.:2017  
##  Median :2018  
##  Mean   :2018  
##  3rd Qu.:2018  
##  Max.   :2018

table() computes the cross-tabulation of categorical variables:

# cross-tabulation (categorical variables)
table(df_merged$year)
## 
## 2017 2018 
##    3    3
table(df_merged$year, df_merged$profession)
##       
##        Data Scientist Economist
##   2017              2         1
##   2018              2         1

Hint: this is also very useful to detect potential inconsistencies in the coding of categorical variables.

Select a subset of columns/variables via select():

df_selection <- select(df_merged, id, year, money_spent, currency)
df_selection
##   id year money_spent currency
## 1  1 2017        1000      CHF
## 2  1 2018        1500      EUR
## 3  2 2017        2000      CHF
## 4  2 2018        3000      CHF
## 5  3 2017        6000      USD
## 6  3 2018        5500      USD

Filter for a subset of rows/observations via filter():

# one filtering condition
filter(df_selection, year == 2018)
##   id year money_spent currency
## 1  1 2018        1500      EUR
## 2  2 2018        3000      CHF
## 3  3 2018        5500      USD
# several filtering conditions
filter(df_selection, year == 2018, money_spent < 5000, currency=="EUR")
##   id year money_spent currency
## 1  1 2018        1500      EUR

Add/change variables (columns) via mutate(). In this example, we first create an additional data frame then add the data from this new data frame via merge() to the main data frame, and finally create a new version of the main data frame with a new column via mutate().

# initiate an exchange-rate data frame
exchange_rates <- data.frame(exchange_rate= c(0.9, 1, 1.2),
                             currency=c("USD", "CHF", "EUR"), stringsAsFactors = FALSE)
# join the exchange-rate with the main dataset
df_selection <- merge(df_selection, exchange_rates, by="currency")

# add the new variable with money spent in CHF
df_mutated <- mutate(df_selection, money_spent_chf = money_spent * exchange_rate)
df_mutated
##   currency id year money_spent exchange_rate money_spent_chf
## 1      CHF  1 2017        1000           1.0            1000
## 2      CHF  2 2017        2000           1.0            2000
## 3      CHF  2 2018        3000           1.0            3000
## 4      EUR  1 2018        1500           1.2            1800
## 5      USD  3 2017        6000           0.9            5400
## 6      USD  3 2018        5500           0.9            4950

Summary statistics and aggregation

Aggregate Statistics

In R, we can typically rely on two types of functions that help us compute descriptive statistics for a dataset.

  1. Functions to compute statistics (mean(), median()).
  2. Functions to apply the statistics function to one or several columns in a tidy dataset. Either for all values in a column, or separately by groups of observations (observation categories, e.g. by gender, year, location, etc.)

For example, summarise() and group_by() provided in tidyverse, as well as sapply(), apply(), lapply(), etc. provided in base-R.

Use summarise() to create a simple descriptive statistics table:

# generate a simple descriptive statistics table
summarise(df_mutated,
          mean = mean(money_spent_chf),
          standard_deviation = sd(money_spent_chf),
          median = median(money_spent_chf),
          N = n())
##   mean standard_deviation median N
## 1 3025           1788.785   2500 6

In case the descriptive statistics should be computed for a set of sub-groups, combine summarise() with group_by() as follows:

# summarize by groups of observations ("split-apply-combine")
# split into year-groups
by_year <- group_by(df_mutated, year)
# apply statistics functions, and combine results in one table
summarise(by_year,
          mean = mean(money_spent_chf),
          standard_deviation = sd(money_spent_chf),
          median = median(money_spent_chf),
          N = n())
## # A tibble: 2 × 5
##    year  mean standard_deviation median     N
##   <dbl> <dbl>              <dbl>  <dbl> <int>
## 1  2017  2800              2307.   2000     3
## 2  2018  3250              1590.   3000     3

Appart from the tidyverse-approach via summarise(), basic R provides a set of helpful apply-type functions to summarize datasets in a more flexible way. For example, we can use sapply() to compute the mean of each column in a data frame:

# apply-type functions for data summaries

# load data
data("swiss")

# compute the mean for each variable (column)
sapply(swiss, mean)
##        Fertility      Agriculture      Examination        Education         Catholic 
##         70.14255         50.65957         16.48936         10.97872         41.14383 
## Infant.Mortality 
##         19.94255

Data Analytics: a primer in linear regression/OLS

Some vocabulary

  • Dependent variable: \(y_i\).
  • Explanatory variable: \(x_i\).
  • “All the rest”: \(u_{i}\) (the ‘residuals’ or the ‘error term’).

A simple linear model: \(y_{i}= \alpha + \beta x_{i} + u_{i}\).

  • \(\alpha\) is the intercept. By adding an intercept to the linear model, we acknowledge that the value of \(y_i\) might not be 0 when \(x_i\) is 0.
  • \(\beta x_{i}\) is the slope coefficient. It’s value tells us by how much \(y_i\) changes if \(x_i\) is increased by one unit.

Illustration with pseudo-data

First, we define the key parameters for the simulation. We choose the actual values of \(\alpha\) and \(\beta\), and set the number of observations \(N\).

alpha <- 30
beta <- 0.9
N <- 1000

Now, we initiate a vector \(x\) of length \(N\) drawn from the uniform distribution (between 0 and 0.5). This will be our explanatory variable.

x <- runif(N, 0, 50)

Next, we draw a vector of random errors (residuals) \(u\) (from a normal distribution with mean=0 and SD=0.05) and compute the corresponding values of the dependent variable \(y\).

# draw the random errors (all the other factors also affecting y)
epsilon <- rnorm(N, sd=10)
# compute the dependent variable values
y <- alpha + beta*x + epsilon

Illustration with pseudo-data

plot(x,y)
abline(a = alpha, b=beta, col="red")

Illustration with pseudo data: “averaging”

# compute average y per x intervals
lower <- c(0,10,20,30,40)
upper <- c(lower[-1], 50)
n_intervals <- length(lower)
y_bars <- list()
length(y_bars) <- n_intervals
x_bars <- list()
length(x_bars) <- n_intervals
for (i in 1:n_intervals){
  y_bars[i] <- mean(y[lower[i] <= x & x < upper[i]])
  x_bars[i] <- mean(x[lower[i] <= x & x < upper[i]])

}
y_bars <- unlist(y_bars)
x_bars <- unlist(x_bars)

# add to plot
plot(x,y)
abline(a = alpha, b=beta, col="red")
points(x_bars, y_bars, col="blue", lwd=10)

Parameter estimation: “average out the \(u\)

Clearly, the average values are much closer to the real values. That is, we can ‘average out’ the \(u\) in order to get a good estimate for the effect of \(x\) on \(y\) (to get an estimate of \(\beta\)). With this understanding, we can now formalize how to compute \(\beta\) (or, to be more precise, an estimate of it: \(\hat{\beta}\)). For simplicity, we take \(\alpha=30\) as given.

In a first step we take the averages of both sides of our initial regression equation:

\(\frac{1}{N}\sum{y_i}=\frac{1}{N}\sum{(30 + \beta x_{i} + u_{i})}\),

rearranging and using the common ‘bar’-notation for means, we get

\(\bar{y}=30+\beta\bar{x} + \bar{u}\),

and solving for \(\beta\) and some rearranging then yields

\(\beta=\frac{\bar{y}-30-\bar{u}}{\bar{x}}\).

While the elements in \(\bar{u}\) are unobservable, we can use the rest to compute an estimate of \(\beta\):

\(\hat{\beta}=\frac{\bar{y}-30}{\bar{x}}\).

(mean(y) -30)/mean(x)
## [1] 0.886768

Data analytics perspective and estimation: data

# load the data
data(swiss)
# look at the description
?swiss

Data analytics perspective and estimation: research question

  • Do more years of schooling improve educational outcomes?
  • Approximate educational attainment with the variable Education and educational outcomes with the variable Examination.
  • Make use of the simple linear model to investigate whether more schooling improves educational outcomes (on average)?

Model specification

Our very simple model of how the average level of educational attainment in a district \(i\) is related to the average performance in the standardized test is defined as follows:

\(Examination_{i}= \alpha + \beta Education_{i}\).

Intuitively we would expect that \(\beta\) is positive, indicating that a higher share of draftees with more years of schooling results in a higher share of draftees who reach the highest examination mark.

To formally acknowledge that other factors might also play a role, we extend our model with the term \(u_{i}\). For the moment, we thus subsume all other potentially relevant factors in that term:

\(Examination_{i}= \alpha + \beta Education_{i} + u_{i}\).

Scatter-plot of the raw data

plot(swiss$Education, swiss$Examination)

Derivation and implementation of OLS estimator

From the model equation we easily see that these ‘differences’ between the predicted and the actual values of \(y\) are the remaining unexplained component \(u\):

\(y_{i}-\hat{\alpha}-\hat{\beta} x_i=u_i\).

Hence, we want to minimize the sum of squared residuals (SSR): \(\sum{u_i^2}=\sum{(y_{i}-\hat{\alpha}-\hat{\beta} x_i)^2}\). Using calculus, we define the two first order conditions:

\[\frac{\partial SSR}{\partial \hat{\alpha}}=\sum{-2(y_{i}-\hat{\alpha}-\hat{\beta} x_i)}=0\]

\[\frac{\partial SSR}{\partial \hat{\beta}}=\sum{-2x_i(y_{i}-\hat{\alpha}-\hat{\beta} x_i)}=0\]

Derivation and implementation of OLS estimator

The first condition is relatively easily solved by getting rid of the \(-2\) and considering that \(\sum{y_i}=N\bar{y}\): \(\hat{\alpha}=\bar{y}-\hat{\beta}\bar{x}\).

Derivation and implementation of OLS estimator

By plugging the solution for \(\hat{\alpha}\) into the first order condition regarding \(\hat{\beta}\) and again considering that \(\sum{y_i}=N\bar{y}\), we get the solution for the slope coefficient estimator:

\(\frac{\sum{x_{i}y_{i}}-N\bar{y}\bar{x}}{\sum{x_i^2}-N\bar{x}^2}\).

OLS estimator in R!

# implement the simple OLS estimator
# verify implementation with simulated data from above
# my_ols(y,x) 
# should be very close to alpha=30 and beta=0.9
my_ols <- 
  function(y,x) {
    N <- length(y)
    betahat <- (sum(y*x) - N*mean(x)*mean(y)) / (sum(x^2)-N*mean(x)^2)
    alphahat <- mean(y)-betahat*mean(x)
    
    return(list(alpha=alphahat,beta=betahat))
  }

# estimate effect of Education on Examination
estimates <- my_ols(swiss$Examination, swiss$Education)
estimates
## $alpha
## [1] 10.12748
## 
## $beta
## [1] 0.5794737

We can visually verify how well our own OLS-estimator works as follows:

plot(swiss$Education, swiss$Examination)
abline(estimates$alpha, estimates$beta, col="red")

Regression toolbox in R

Of course, R already provides various functions to run regression analyses. The most commonly used function for simple linear regressions is lm():

estimates2 <- lm(Examination~Education, data=swiss)
estimates2
## 
## Call:
## lm(formula = Examination ~ Education, data = swiss)
## 
## Coefficients:
## (Intercept)    Education  
##     10.1275       0.5795

With one additional line of code we can compute all the common statistics about the regression estimation:

summary(estimates2)
## 
## Call:
## lm(formula = Examination ~ Education, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9322  -4.7633  -0.1838   3.8907  12.4983 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.12748    1.28589   7.876 5.23e-10 ***
## Education    0.57947    0.08852   6.546 4.81e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.773 on 45 degrees of freedom
## Multiple R-squared:  0.4878, Adjusted R-squared:  0.4764 
## F-statistic: 42.85 on 1 and 45 DF,  p-value: 4.811e-08

References

Wickham, Hadley, and Garrett Grolemund. 2017. Sebastopol, CA: O’Reilly. http://r4ds.had.co.nz/.